Efficient Data Representations for Information Retrieval

نویسنده

  • J. Shane Culpepper
چکیده

The key role compression plays in efficient information retrieval systems has been recognized for some time. However, applying a traditional compression algorithm to the contents of an information retrieval system is often not the best solution. For example, it is inefficient to perform search operations in maximally compressed data or to find the intersection of maximally compressed sets. In order to perform these operations, the data representation must be fully decompressed. This thesis explores practical space versus time trade-offs which balance storage space against the competing requirement that operations be performed quickly. In particular, we are interested in variable length coding methods which are both practical and allow codeword boundaries to be found directly in the compressed representation. The latter property allows considerable flexibility in developing algorithms which can manipulate compact sets and sequences and allow selective decompression. Applications of such coding methods are plentiful. For instance, variations of this theme provide practical solutions to the compressed pattern matching problem. They are also a vital element in many of the compact dictionary representations recently proposed. In this work, we propose new data representations which allow key query operations to be performed directly on compressed data. This thesis draws together previous work and shows the fundamental importance of coding methods in which control information is built directly into the representation. More particularly, this thesis (a) reviews current applications of compression in string searching algorithms; (b) critically evaluates existing coding approaches which allow fast codeword length identification; (c) introduces a new coding method which shares this property; (d) investigates applications of these coding approaches to searching and seeking in compact sequences; and (e) explores new data representations which provide a compromise between good compression and fast querying in compact sets.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Developing a BIM-based Spatial Ontology for Semantic Querying of 3D Property Information

With the growing dominance of complex and multi-level urban structures, current cadastral systems, which are often developed based on 2D representations, are not capable of providing unambiguous spatial information about urban properties. Therefore, the concept of 3D cadastre is proposed to support 3D digital representation of land and properties and facilitate the communication of legal owners...

متن کامل

An Effective Path-aware Approach for Keyword Search over Data Graphs

Abstract—Keyword Search is known as a user-friendly alternative for structured languages to retrieve information from graph-structured data. Efficient retrieving of relevant answers to a keyword query and effective ranking of these answers according to their relevance are two main challenges in the keyword search over graph-structured data. In this paper, a novel scoring function is proposed, w...

متن کامل

Efficient Content-Based Information Retrieval: A New Similarity Measure for Multimedia Data

Content-based information retrieval of multimedia data is a great and attractive challenge which raises numerous research activities. As multimedia data become ubiquitous in our daily lives, information retrieval systems have to adapt their retrieval performance to different situations in order to efficiently satisfy the users’ information needs anytime and anywhere. To enhance the content-base...

متن کامل

Context-based Information seeking behavior among students of Kharazmi University

Background and Aim: The present study has been done in order to survey contextualized information retrieval behavior by the students of Kharazmi University. Methods: This is descriptive applied research. Statistical population includes all the students currently studying at the Kharazmi University in the time of research. Sample of research includes 196 students selected by convenience sampling...

متن کامل

Image Classification via Sparse Representation and Subspace Alignment

Image representation is a crucial problem in image processing where there exist many low-level representations of image, i.e., SIFT, HOG and so on. But there is a missing link across low-level and high-level semantic representations. In fact, traditional machine learning approaches, e.g., non-negative matrix factorization, sparse representation and principle component analysis are employed to d...

متن کامل

Attribute-based Access Control for Cloud-based Electronic Health Record (EHR) Systems

Electronic health record (EHR) system facilitates integrating patients' medical information and improves service productivity. However, user access to patient data in a privacy-preserving manner is still challenging problem. Many studies concerned with security and privacy in EHR systems. Rezaeibagha and Mu [1] have proposed a hybrid architecture for privacy-preserving accessing patient records...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007